1,143 research outputs found

    A map of human genome variation from population-scale sequencing

    Get PDF
    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother–father–child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10−8 per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research

    An integrated map of genetic variation from 1,092 human genomes

    Get PDF
    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations

    A global reference for human genetic variation

    Get PDF
    The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies

    The landscape of human STR variation

    Get PDF
    Short tandem repeats are among the most polymorphic loci in the human genome. These loci play a role in the etiology of a range of genetic diseases and have been frequently utilized in forensics, population genetics, and genetic genealogy. Despite this plethora of applications, little is known about the variation of most STRs in the human population. Here, we report the largest-scale analysis of human STR variation to date. We collected information for nearly 700,000 STR loci across more than 1000 individuals in Phase 1 of the 1000 Genomes Project. Extensive quality controls show that reliable allelic spectra can be obtained for close to 90% of the STR loci in the genome. We utilize this call set to analyze determinants of STR variation, assess the human reference genome’s representation of STR alleles, find STR loci with common loss-of-function alleles, and obtain initial estimates of the linkage disequilibrium between STRs and common SNPs. Overall, these analyses further elucidate the scale of genetic variation beyond classical point mutations.American Society for Engineering Education. National Defense Science and Engineering Graduate Fellowshi

    Investigating genome-wide transcriptional and methylomic consequences of a balanced t(1;11) translocation linked to major mental illness

    Get PDF
    Schizophrenia, bipolar disorder and major depressive disorder are devastating psychiatric conditions with a complex, overlapping genetic and environmental architecture. Previously, a family has been reported where a balanced chromosomal translocation between chromosomes 1 and 11 [t(1;11)] shows significant linkage to these disorders. This translocation transects three genes: Disrupted in schizophrenia- 1 (DISC1) on chromosome 1, a non-coding RNA, Disrupted in schizophrenia-2 (DISC2) antisense to DISC1, and a non-coding transcript, DISC1 fusion partner-1 (DISC1FP1) on chromosome 11, all of which could result in pathogenic properties in the context of the translocation. This thesis focuses on the genome-wide effects of the t(1;11) translocation, primarily examining differences in gene expression and DNA methylation, using various biological samples from the t(1;11) family. To assess the genome-wide effects of the t(1;11) translocation on methylation, DNA methylation was profiled in whole-blood from 41 family members using the Infinium HumanMethylation450 BeadChip. Significant differential methylation was observed within the translocation breakpoint regions on chromosomes 1 and 11. Downstream analysis identified additional regions of differential methylation outwith these chromosomes, while pathway analysis showed terms related to psychiatric disorders and neurodevelopment were enriched amongst differentially methylated genes, in addition to more general terms pertaining to cellular function. Using induced pluripotent stem cell (iPSC) technology, neuronal samples were developed from fibroblasts in a subset of individuals profiled for genome-wide methylation in whole blood (N = 6) with an aim to replicate the significant findings around the breakpoint regions. Here, methylation was profiled using the Infinium HumanMethylation450 BeadChip’s successor: the Infinium MethylationEPIC BeadChip. The results from the blood-based study failed to replicate in the neuronal samples, which could be attributed to low statistical power or tissue-specific factors such as methylation quantitative trait loci. The differences in methylation in the most significantly differentially methylated loci were found to be driven by a single individual, rendering further interpretation of the findings from this analysis difficult without additional samples. Cross-tissue analyses of DNA methylation were performed on blood and neuronal DNA from these six individuals, revealing little correlation between cell types. DISC1 is central to a network of interacting protein partners, including the transcription factor ATF4, and PDE4; both of which are associated with the cAMP signalling pathway. Haploinsufficiency of DISC1 due to the translocation may therefore be disruptive to cAMP-mediated gene expression. In order to identify transcriptomic effects which may be related to the t(1;11) translocation, genome-wide expression profiling was performed in lymphoblastoid cell line RNA from 13 family members. No transcripts were found to be differentially expressed at the genome-wide significant level. A post-hoc power analysis suggested that more samples would be required in order to detect genome-wide significant differential expression. However, imposing a fold-change cut-off to the data identified a number of candidate genes for follow-up analysis, including SORL1: a member of the brain-expressed Sortilin gene family. Sortilin genes have been linked to multiple psychiatric disorders including schizophrenia, bipolar disorder and Alzheimer’s disease. Follow-up analyses of Sortilin family members were performed in a Disc1 mouse model of schizophrenia, containing an amino acid substitution (L100P). Here, developmental gene expression profiling was performed with an additional aim to optimise and validate work performed by others using this mouse model. However, results from these experiments were variable between two independent batches mice tested. Additional investigation of Sortilin family genes was performed using GWAS data from human samples, using machine learning techniques to identify epistatic interactions linked to depression and brain function, revealing no statistically significant interactions. The results presented in this thesis suggest a potential mechanism for differential DNA methylation in the context of chromosomal translocations, and suggests mechanisms whereby increased risk of illness is conferred upon translocation carriers through dysregulation of transcription and DNA methylation

    HapZipper: sharing HapMap populations just got easier

    Get PDF
    The rapidly growing amount of genomic sequence data being generated and made publicly available necessitate the development of new data storage and archiving methods. The vast amount of data being shared and manipulated also create new challenges for network resources. Thus, developing advanced data compression techniques is becoming an integral part of data production and analysis. The HapMap project is one of the largest public resources of human single-nucleotide polymorphisms (SNPs), characterizing over 3 million SNPs genotyped in over 1000 individuals. The standard format and biological properties of HapMap data suggest that a dedicated genetic compression method can outperform generic compression tools. We propose a compression methodology for genetic data by introducing H ap Z ipper , a lossless compression tool tailored to compress HapMap data beyond benchmarks defined by generic tools such as gzip , bzip2 and lzma . We demonstrate the usefulness of H ap Z ipper by compressing HapMap 3 populations to <5% of their original sizes. H ap Z ipper is freely downloadable from https://bitbucket.org/pchanda/hapzipper/downloads/HapZipper.tar.bz

    A bi-objective feature selection algorithm for large omics datasets

    Get PDF
    Special Issue: Fourth special issue on knowledge discovery and business intelligence.Feature selection is one of the most important concepts in data mining when dimensionality reduction is needed. The performance measures of feature selection encompass predictive accuracy and result comprehensibility. Consistency based methods are a significant category of feature selection research that substantially improves the comprehensibility of the result using the parsimony principle. In this work, the bi-objective version of the algorithm Logical Analysis of Inconsistent Data is applied to large volumes of data. In order to deal with hundreds of thousands of attributes, heuristic decomposition uses parallel processing to solve a set covering problem and a cross-validation technique. The bi-objective solutions contain the number of reduced features and the accuracy. The algorithm is applied to omics datasets with genome-like characteristics of patients with rare diseases.The authors would like to thank the FCT support UID/Multi/04046/2013. This work used the EGI, European Grid Infrastructure, with the support of the IBERGRID, Iberian Grid Infrastructure, and INCD (Portugal).info:eu-repo/semantics/publishedVersio

    Sequence data of six unusual alleles at SE33 and D1S1656 STR Loci

    Get PDF
    When profiling a reference dataset of 500 DNA samples for the population of Saudi Arabia, using the GlobalFiler® PCR amplification kit, six unusual alleles were detected. At the SE33 locus, four novel alleles were found: 2, 14.3, 20.3, and 38; two alleles, at the D1S1656 locus: 7 and 8, had been previously reported, but no published sequence data was available. The D1S1656 alleles were sequenced using ForenSeq™ DNA Signature Prep with the MiSeq FGx System (Illumina, USA). As the SE33 is not reported by available Massively Parallel Sequencing (MPS) systems, samples that exhibited the unreported alleles were sequenced using BigDye™ Terminator v3.1 Cycle Sequencing Kit. Here we present the sequence and structure of the previously uncharacterized alleles

    Mining data from 1000 genomes to identify the causal variant in regions under positive selection

    Get PDF
    The human genome contains hundreds of regions in which the patterns of genetic variation indicate recent positive natural selection, yet for most of these the underlying gene and the advantageous mutation remain unknown. We recently reported the development of a method, Composite of Multiple Signals (CMS), that combines tests for multiple signals of natural selection and increases resolution by up to 100-fold
    corecore